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Abstract 


We exploit the gap in ability between human and ma- 
chine vision systems to craft a family of automatic chal- 
lenges that tell human and machine users apart via graphi- 
cal interfaces including Internet browsers. Turing proposed 
[Tur50] a method whereby human judges might validate 
“artificial intelligence” by failing to distinguish between 
human and machine interlocutors. Stimulated by the “chat 
room problem” posed by Udi Manber of Yahoo!, and influ- 
enced by the CAPTCHA project [BALOO] of Manuel Blum 
et al of Carnegie—Mellon Univ., we propose a variant of the 
Turing test using pessimal print: that is, low—quality im- 
ages of machine-printed text synthesized pseudo-randomly 
over certain ranges of words, typefaces, and image degra- 
dations. We show experimentally that judicious choice of 
these ranges can ensure that the images are legible to hu- 
man readers but illegible to several of the best present— 
day optical character recognition (OCR) machines. Our 
approach is motivated by a decade of research on perfor- 
mance evaluation of OCR machines [RJN96,RNN99] and 
on quantitative stochastic models of document image qual- 
ity [Bai92,Kan96]. The slow pace of evolution of OCR 
and other species of machine vision over many decades 
[NS96,Pav00] suggests that pessimal print will defy auto- 
mated attack for many years. Applications include ‘bot’ 
barriers and database rationing. 


Keywords: legibility, document image analysis, OCR 
evaluation methods, document image degradation, hu- 
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1 Introduction 
1.1 Turing’s Test and Variants 


Alan Turing proposed [Tur50] a method to assess 
whether or not a machine can think, by means of an 
“imitation game” conducted over teletype connections in 
which a human judge asks questions (“challenges”) of 
two respondents — one human and the other a machine — 
and eventually decides which is human; failure to decide 
correctly would be, Turing suggested, convincing evidence 
of artificial intelligence in the machine. Extensions to 
the test have been proposed [SCAOO], e.g. to incorporate 
behavior richer than teletypes can communicate. Graphical 
user interfaces (GUI) invite the use of images as well as 
text in challenges. 


1.2 Machines Impersonating People 


The world—wide proliferation of GUIs in the 1990’s has 
opened up new uses for variants on the Turing test. In 
September 2000, Udi Manber of Yahoo! described this 
“chat room problem” to researchers at Carnegie-Mellon 
Univ.: “bots” (virtual persons) are being let loose in on- 
line chat rooms in an attempt to elicit personal information 
from the human participants — how can they be identified? 
Furthermore, “intelligent agents” are able systematically to 
mine databases that are intended for occasional use by indi- 
vidual humans. There is a growing need for automatic meth- 
ods to distinguish between human and machine users on the 
Web. 

Manuel Blum, Luis A. von Ahn, and John Langford have 
articulated [BAL00] desirable properties of such tests, in- 


cluding: 
e the test’s challenges can be automatically generated; 
e the test can be taken quickly by human users; 


e the test will accept virtually all human users (even 
young or naive users) with high reliability while reject- 
ing very few; 


e the test will reject virtually all machine users; and 


e the test will resist automatic attack for many years even 
as technology advances and even if the test’s algo- 
rithms are known (e.g. published and/or released as 
open source). 


On hearing these, we saw an opportunity to bring to bear 
the well-known and extensively studied gap in image pat- 
tern recognition ability between human and machine vision 
systems. 


1.3 Document Image Quality 


Low-quality images of printed—text documents pose se- 
rious challenges to current image pattern recognition tech- 
nologies [RJN96,RNN99]. In an attempt to understand 
the nature and severity of the challenge, models of docu- 
ment image degradations [Bai92,Kan96] have been devel- 
oped and used to explore the limitations [HB97] of im- 
age pattern recognition algorithms. The model of [Bai92], 
used throughout this study, approximates ten aspects of the 
physics of machine—printing and imaging of text, includ- 
ing spatial sampling rate and error, affine spatial deforma- 
tions, jitter, speckle, blurring, thresholding, and symbol size. 
Figure 1 shows examples of text images that were synthet- 
ically degraded according to certain parameter settings of 
this model. 

The reader should be able, with little or no conscious ef- 
fort, to read all these images: so will, we expect, almost ev- 
ery person literate in the Latin alphabet, familiar with the 
English language, and with some years of reading experi- 
ence. The image quality of these cases of “pessimal print” is 
far worse than people routinely encounter, but it’s not quite 
bad enough to defeat the human visual system. 

However, present-day “reading machines” (or, optical 
character recognition (OCR) machines) are baffled by these 
images, as we shall show. 


1.4 Design of a Reverse Turing Test 


We propose what we call a “reverse Turing test” of the 
following kind. When a user — human or machine — 
chooses to take the test (e.g. in order to enter a protected 
Web site), a program challenges the user with one synthet- 
ically generated image of text; the user must type back the 


Figure 1. Examples of synthetically 
generated images of machine—printed 
words, in various typefaces and de- 
graded pseudo-randomlly. 


text correctly in order to enter. This differs from Turing’s 
proposal in at least four ways: 


e the judge is a machine, rather than human; 
e there is only one user, rather than two; 


e the design goal is to distinguish, rather than to fail to 
distinguish, between human and machine; and 


e the test poses only one challenge — or very few — rather 
than an indefinitely long sequence of challenges. 


(We are grateful for discussions with Manuel Blum in con- 
nection with this design.) 

The challenges must be substantially different almost 
every time, else they might be recorded exhaustively, an- 
swered off-line by humans, and then used to answer future 
challenges. Thus we propose that they be generated pseu- 
dorandomly from a potentially very large space of distinct 
challenges. 


2 Experiments 


In this design, the essential issue is the choice of the 
family of challenges: that is, some broad conditions under 
which text—images can be generated that are human-legible 
but machine-illegible. We carried out a search for these 
conditions with the kind assistance of ten graduate student 
and faculty volunteers in the Computer Science Division of 
Univ. of California, Berkeley. Our machine subjects were 
three of the best present-day commercial OCR systems: Ex- 
pervision TR, ABBYY Fine reader, and the IRIS Reader. 


2.1 Experimental Design 


We synthesized challenges by pseudo-randomly uni- 
formly and independently selecting: 


e aword (from a fixed list); 
è a typeface (from a fixed list); and 


e a set of image—degradation parameters (from fixed 
ranges). 


and then generating a single black—and—white image. 

We selected 70 words according to a set of characteris- 
tics that we believed would favor human recognition over 
OCR machines. We restricted ourselves to natural—language 
words, since humans recognize words more quickly than 
non-word letter strings [TS81]. We chose words occurring 
with high frequency on the WWW, so not to penalize young 
or naive users. All words were in English, since English is 
the most widely used language on the WWW. All words had 
at least five letters and at most eight letters, because shorter 
words are few enough to invite an exhaustive template— 
matching attack, and longer words are unique enough to in- 
vite a feature—based attack. We used only words with no 
ascenders or descenders, since [Spi97] has shown these are 
strong cues for recognition even when individual characters 
are not reliably recognizable. 

In our choice of image degradations, we were guided by 
the discussion in [RNN99] of cases that defeat modern OCR 
machines, especially: 


e thickened images, so that characters merge together; 


e thinned images, so that characters fragment into uncon- 
nected components; 


e noisy images, causing rough edges and salt—and— 
pepper noise; 


condensed fonts, with narrower aspect ratios than 
usual; and 


Italic fonts, whose rectilinear bounding boxes overlap 
their neighbors’. 


We explored ranges of values for two of the degrada- 
tion parameters: blurring and thresholding (blur and thrs 
in [Bai92]; for their precise definitions, and for others’ later, 
consult this reference; other parameters’ values were fixed 
at: sens=0.05, skew=1.5 degrees, xscl=0.5, yscl=1.0). 

We tested five typefaces: Times Roman (TR), Times 
Italic (TI), Courier Oblique (CO), Palatino Roman (PR), and 
Palatino Italic (PI). Note that xscl=0.5 has the effect of com- 
pressing the images horizontally by a factor of two, simulat- 
ing abnormally condensed variants of these commonly oc- 
curring faces. The type size was fixed at size=8.0 point and 
the spatial sampling rate at resn=400 pixels/inch. 


2.2 Experimental Results 


The experiments located ranges of parameter values for 
blur and threshold with the desired pessimal-print proper- 
ties: 


e thrs € [0.01,0.02] for any value of blur; and 
e thrs € [0.02,0.06] and blur=0.0; 
These ranges represent, roughly, two types of images: 


e extremely thinned and fragmented images, with a little 
salt-and-pepper noise; and 


noisy images (whether thinned or thickened); 


Of a total of 685 word—images generated within these 
ranges, over all five typefaces, all were human-legible — but 
all three OCR machines failed on virtually every word, as il- 
lustrated in the following table of data selected from the ex- 
periments. 


OCR Machine Accuracy (%), by typeface, for blur € 
[0.0,0.8] and thrs=0.02. (Machines: E-TR = Expervision 
TR; A-FR = ABBYY FineReader; I-R = IRIS Reader). 


Face E-TR A-FR I-R #Words 
TR 0.00% 0% 0% 136 
TI 0.76% 0% 0% 132 
CoO 0.66% 0% 0% 152 
PR 0.00% 0% 0% 198 
PI 0.00% 0% 0% 67 
TOTAL 0.29% 0% 0% 685 


What is more, on almost all of these words, no machine 
guessed a single alphabetic letter, either correctly or incor- 
rectly. The following figures shown a selection of machine- 
illegible and machine-legible word images. 


et eee lan 
I" a ee eee 


Figure 2. Examples of synthetically gen- 
erated images of printed words which are 
machine-illegible due to degradation pa- 
rameter thrs = 0.02. 


science 
iedson 


Figure 3. Examples of synthetically gen- 
erated images of printed words which are 


machine-legible due to a slightly better 
thrs = 0.07. 


Each OCR machine’s performance was sensitive to slight 
changes in the parameters. For example, one machine’s ac- 
curacy dropped from 40 — 50% to 0% when thrs fell from 
0.04 to 0.02 (for blur = 0.8). Also, it dropped from 28% 
to 0% when blur fell from 0.4 to 0.0 (at thrs = 0.04); this 
change is barely perceptible to the eye. Such fragility — 
abrupt catastrophic failure — is typical of many machine vi- 
sion systems attempting to operate at the margins of good 
performance. 


3 Discussion 


Our familiarity with the state of the art of machine vi- 
sion leads us to hypothesize that not only these three but 
all modern OCR machines will not be able to cope with the 
image degradations in the ranges we have identified. Also, 
we are confident that wider ranges, involving other degra- 
dation parameters and other typefaces, exhibiting pessimal— 
print properties, can be found through straightforward ex- 
periment. Blum et al [BALOO] have experimented, on their 
website www.captcha.net, with degradations that are 
not only due to imperfect printing and imaging, but in- 
clude color, overlapping of words, non-linear distortions, 
and complex or random backgrounds. The relative ease with 
which we have been able to generate pessimal print, and the 
diversity of other means of bafflement ready to hand, sug- 
gest to us that the range of effective text-image challenges 
at our disposal is usefully broad. 

How long can a reverse Turing test based on pessi- 
mal print resist attack, given a serious effort to advance 
machine-vision technology, and assuming that the design 
principles — perhaps even the source code — of the test 
are known to attackers? Even given such major hints as the 
dictionary of words, the nature of the distortions, the fonts, 
sizes and other considerations, a successful attack would 
probably require substantially more real time than humans, 
at least for the near future. A statistic today suggests about 


200 msec per comparison between isolated handprinted dig- 
its, using fast 2001 year workstations; many comparisons 
over a far larger set would be needed to solve this problem. 
Also, our investigations and the CMU CAPTCHA project 
are continuing, and so any specific attacks might be thwarted 
in the same kind of arms race used in cryptography. 


A close study of the history of image pattern recognition 
technology [Pav00] and of OCR technology [NS96] in par- 
ticular suggests to us that the gap in ability between human 
and machine vision is wide and is only slowly narrowing. 
We notice that few, if any, machine vision technologies have 
simultaneously achieved all three of these desirable char- 
acteristics: high accuracy, full automation, and versatility. 
Versatility — by which we mean the ability to cope with a 
great variety of types of images — is perhaps the most in- 
tractable of these, and it is the one that pessimal print, with 
its wide range of image quality variations, challenges most 
strongly. 


An ability gap exists for other species of machine vision, 
of course, and in the recognition of non-text images, such as 
line—drawings, faces, and various objects in natural scenes. 
One might reasonably intuit that these would be harder and 
so decide to use them rather than images of text. This intu- 
ition is not supported by the Cognitive Science literature on 
human reading of words. There is no consensus on whether 
recognition occurs letter-by—letter or by a word—template 
model [Cro82,KWB80]; some theories stress the impor- 
tance of contextual clues [GKB83] from natural language 
and pragmatic knowledge. Furthermore, almost all research 
on human reading has used perfectly formed images of text: 
no theory has been proposed for mechanisms underlying the 
human ability to read despite extreme segmentation (merg- 
ing and fragmentation) problems. The resistance of these 
problems to technical attack for four decades and the in- 
completeness of our understanding of human reading abil- 
ities suggests that it is premature to decide that the recogni- 
tion of text under conditions of low quality, occlusion, and 
clutter, is intrinsically much easier — that is, a significantly 
weaker challenge to the machine vision state—of—the—art — 
than recognition of objects in natural scenes. 


There are other, pragmatic, reasons to use images of text 
as challenges: the correct answer is unambiguously clear; 
the answer maps into a unique sequence of keystrokes; and 
it is straightforward automatically to label every challenge, 
even among hundreds of millions of distinct ones, with its 
answer. These advantages are lacking, or harder to achieve, 
for images of objects or natural scenes. 


It might be good in the future to locate the limits of hu- 
man reading in our degradation space: that is, at what point 
do humans find degraded words unreadable; do we smoothly 
decay or do we show the same kind of ”falling off a cliff” 
phenomenon as machines but just at another level? 


4 Conclusions 


We have designed, built, and tested a “reverse” Turing 
test based on “pessimal print” and shown that it has the 
potential of offering a reliable, fast, and fully automatic 
method for telling people and machine users apart over GUI 
interfaces. It is an amusing irony, that we would like to be- 
lieve Alan Turing himself would have savored, that a prob- 
lem — machine reading — which he planned to attack and 
which he expected to yield easily, has instead resisted solu- 
tion for fifty years and now is poised to provide a technical 
basis for the first widespread practical use of variants of his 
proposed test for human/machine distinguishability. 
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